Jeremy James
Home Credit is an international consumer finance provider that lends money primarily to those with little or no credit history. They created a kaggle competition where users would use machine learning and statistical methods to predict the loan default risk of individuals based on Home Credit's loan applicant data. Although Home Credit had already used machine learning to project default risk, they hoped to find ways to improve their predictive ability based on what the Kagglers produced.
Home Credit serves those without credit history or are unbanked, who are likely to be viewed as high risk for a loan default, even if they are financially responsible and will make the necessary repayments. Home Credit needs to avoid making loans to those who will end up unable to complete payments, as this can result in a larger financial cost than the gain of several successful loans. They also want to make their services as accessible as possible through providing the loans to all those who are truly eligible. Machine learning can help us manage these risks and rewards. By building an accurate model, we will be providing Home Credit the ability to provide more loans, since loan applicants identified by our model as at lower risk for a loan default will be less likely to default on the loan.
The main research question is whether we can create a model that can estimate the probability of a loan applicant defaulting. Another goal would be to find the features that best predict the default risk.
The data was retrieved from the Home Credit Defaul Risk Prediction competition on kaggle.com (https://www.kaggle.com/c/home-credit-default-risk/data). The data and supporting information was contained in 10 files, and is 2.68 GB in size.
import numpy as np
import pandas as pd
The training dataset is contained in application_train.csv, and contains a total of 307511 rows.
data = pd.read_csv('./DATA/application_train.csv')
len(data)
307511
Below is view of the first 5 rows:
data.head()
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | ... | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
There are a 122 columns in total, so the above dataset view only contains a fraction of these.
len(data.columns)
122
56 of these are categorical, while the rest are numerical.
I will also be using bureau.csv and bureau_balance.csv, files containing data on existing and past loan balances at other institutions for the loan applicant. The data I am interested in is in bureau_balance.csv, so while there are several columns in bureau.csv, I am only interesting in using the two id columns it contains. These will allow me to match the data in application_train.csv to bureau_balance.csv. Here is the view of the first 5 rows of bureau_balance:
bureau_data = pd.read_csv('./DATA/bureau_balance.csv')
bureau_data.head()
| SK_ID_BUREAU | MONTHS_BALANCE | STATUS | |
|---|---|---|---|
| 0 | 5715448 | 0 | C |
| 1 | 5715448 | -1 | C |
| 2 | 5715448 | -2 | C |
| 3 | 5715448 | -3 | C |
| 4 | 5715448 | -4 | C |
This is a large dataset, as it contains a record for every month and every previously existing/closed loans for the applicants.
len(bureau_data)
27299925
Our target variable is whether the loan applicant defaulted. It is a boolean value, with a value of 0 representing no default, and 1 signaling a loan applicant that ended up defaulting. This is an unbalanced data point, as there are almost 10 times more succesfull loans than defaults
import plotly
import plotly.express as px
plotly.offline.init_notebook_mode()
px.histogram(x="TARGET", data_frame=data, labels={'TARGET':'Loan Default'})
Several variables contain information about the individual applying, including demographics like gender or education status, or miscellaneous information like whether the person owns a home or car. These variables are mainly boolean or categorical.
This group of variables describe the loan itself, such as the amount of money being borrowed, the time the approval process began, and what documentation was provided. These variables are of all types, including numerical.
These variables contain the financial information of the applicant. They include normalized scores (I believe these scores act like a credit score, but Home Credit does not state this in the column description they provided) and the amount of enquiries to Credit Bureau about the client in the time prior to the application. These variables are mainly numerical.
Home Credit has provided columns containing the applicant's annual income and field of work. These variables are of all types.
A fair amount of variables contain data about the property the client lives in. These are numeric, and have been normalized.
This group of variables describe the region where the client lives, including the normalized population and Home Credit's rating of the region.
Most columns are missing less than 1% of their data, outside of the property info columns. Many of the property info columns are missing the majority of their values, which may prevent me from using their values. Other variables with a large share of missing values are the applicant's car age, occupation type, two of their scores (normalized financial scores), and the Credit Bureau enquiry counts.
Records with null property values make up 75% of the data. Almost every record has a null record when you also include where the other frequently missing columns. However, ignoring these columns, only 1% of our data would have a null value.
prop_cols = [col for col in data.columns if 'AVG' in str(col) or 'MEDI' in str(col) or 'MODE' in str(col)]
add_mis_cols = ['OWN_CAR_AGE','OCCUPATION_TYPE','EXT_SOURCE_1','EXT_SOURCE_3']
add_mis_cols.extend([col for col in data.columns if 'AMT_REQ_CREDIT' in str(col)])
other_cols = [col for col in data.columns if str(col) not in prop_cols and str(col) not in add_mis_cols]
len(data['DAYS_BIRTH'].dropna())
307511
len(data[prop_cols].dropna())/len(data)
Note: NumExpr detected 12 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8. NumExpr defaulting to 8 threads.
0.2555843530800524
len(data[prop_cols+add_mis_cols].dropna())/len(data)
0.028197365297501553
len(data[other_cols].dropna())/len(data)
0.9903092897489846
I am planning to drop the property columns from the dataset, and may revisit them later to do some additional feature engineering and data cleansing on them. I will also ignore the applicant's car age, as 2/3 of its values are missing. The remain columns have a significantly lower amount of missing values.
For categorical columns with missing values, I will either choose 1 category as the default category (a category that shouldn't have effect on default probability), or create a new "unknown" category. For numeric columns, I will use KNN imputation with 5 neighbors to fill the values or choose a default value. In order to do the imputation, I selected a subset of columns to reduce computation time and scaled the values. With these changes, I no longer have any missing values in my training dataset.
Home Credit has also provided a file with past loan balances at other institutions for the loan applicant, titled bureau_balance.csv. I will be aggregating this file by loan to get the count of months where the loan was late on payment. I will than match up these loans to the applicants using the data in bureau.csv. Applicants can have multiple loans, so for each applicant, I will take the total number of defaults accross all their loans. Final, I will then add the total default count as a new variable in our training dataset by doing a left join on the applicant id, where the left dataset is our original training dataset.
So far, I have created 3 new variables. The first is the social circle default ratio columns. These 2 columns our based on 4 other columns in the original dataset that contain the amount of observations along with the amount of defaults in the client's social circle. By creating a ratio, we should be able to know the true propensity of defaults in the client's social circle as a ratio, instead of only going off the raw default count. These values are generated for a 30 day default and 60 day default.
The other new column was the amount of requests to the Credit Bureau over the year prior to the application. In the original table, the total number of requests are split between the last hour, day, week, month, quarter, and year, excluding the observations added to another column. For example, the year request count excludes the count from the last quarter. I will be adding these columns together for one final annual total, as I don't believe we have enough observations for the smaller periods for them to be useful. Finally, the column from the bureau aggregation will also be added to the training dataset.
After our column drops and additions, we have 78 columns in our cleaned data, 52 of which are categorical and 26 numerical. Below is a views of the first 5 rows of this data.
final_dataset = pd.read_csv('./DATA/final_dataset.csv')
final_dataset.head()
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | DEF_30_RATIO_SOCIAL_CIRCLE | DEF_60_RATIO_SOCIAL_CIRCLE | AMT_REQ_CREDIT_BUREAU | DEFAULT_COUNT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 27.0 |
| 1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 4 | 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
len(final_dataset.columns)
78
# Find columns with missing values
final_dataset.columns[final_dataset.isna().any()].tolist()
[]
InterpretML. "Explainable Boosting Machine". https://interpret.ml/docs/ebm.html
Harsha Nori et. al. "InterpretML: A Unified Framework for Machine Learning". https://arxiv.org/pdf/1909.09223.pdf
Yin Lou et. al. "Intelligible Models for Classification and Regression". https://www.cs.cornell.edu/~yinlou/papers/lou-kdd12.pdf
Explainable Boosting Machine (EBM) is a generalized additive model (GAM), having the form:
$$g(E[y]) = \beta _{0} + \Sigma f_{j}(x_{j})$$There are two major differences between EBM and a standard GAM. First, each feature function $f_{j}$ is tuned using techniques like bagging and gradient boosting. Second, automatic pairwise interaction detection is supported, so the EBM can be represented as:
$$g(E[y]) = \beta _{0} + \Sigma f_{i}(x_{i}) + \Sigma f_{i,j}(x_{i},x_{j})$$A full description of the algorithm can be found in "Intelligible Models for Classification and Regression".
EBM can be found in InterpretML, a package containing several machine learning interpretability methods. I will be using the python version. InterpretML was developed at Microsoft Research and made open source. EBM is an implemetation of the algorithm proposed by Lou et. al. ("Intelligible Models for Classification and Regression"), and follows scikit-learn's API. As part of its "interpretable" nature, visuals revealing the model's structure or explaining its output are easily created. Example code can be found in the "Model Results" section of this report.
Chen et. al. "Using Random Forest to Learn Imbalanced Data". https://unomaha.instructure.com/courses/51016/pages/teaching-presentation
Leo Breiman. "Random Forests". https://link.springer.com/content/pdf/10.1023%2FA%3A1010933404324.pdf
A random forest classifier consists of multiple trees, all providing predictions given an input. Each of these trees is trained using a random subset of variables and observations from the original input. When dealing with imbalanced data, random forests will often ignore the minority class, as the standard error/objective functions provide little penalty for misclassified minority class observations due to their low amount. One way to prevent this is applying a heavier penalty for misclassifying a minority class, which can be done by applying a large weight to minority classes. A normal Gini function (the function that calculates the impurity of a tree node) will look like the below:
$$1 - [(Ratio_{0})^2 + (Ratio_{1})^2]$$A weighted version looks like this:
$$1 - [w_{0}(Ratio_{0})^2 + w_{1}(Ratio_{1})^2]$$I used the implementation of the Random Forest Classifier in Python's scikit-learn package. scikit-learn comes with many different machine learning models, as well as functions to preprocess data and evaluate model results. Due to scikit-learn's well know API, it was easy to create the model and generate predictions. Example code can be found in the "Model Results" section.
# helper function
def binner(x, y, bins):
bin_labels = []
bin_values = []
for i in range(len(bins)+1):
if i == 0:
label = "0-" + str(bins[i])
tf1 = x >= 0
tf2 = x < bins[i]
tf = tf1 & tf2
val = y[tf].mean()
elif i == len(bins):
label = str(bins[i-1]) + "+"
tf = x >= bins[i-1]
val = y[tf].mean()
else:
label = str(bins[i-1]) + "-" + str(bins[i])
tf1 = x >= bins[i-1]
tf2 = x < bins[i]
tf = tf1 & tf2
val = y[tf].mean()
bin_labels.append(label)
bin_values.append(val)
return bin_labels, bin_values
len(final_dataset.columns)
78
We have 78 columns in our final dataset. I have selected a subset of these that showcase the kinds of relationships the independent variables have with our target variable.
The score variables (column names EXT_SOURCE_1, 2, and 3) have the most clear relationship with the default count. The below graph displays the default rate for different buckets of the second score. As the score increases, defaults become less common.
labels, vals = binner(final_dataset['EXT_SOURCE_2'], final_dataset['TARGET'], [v/10 for v in range(1,8,1)])
px.bar(x=labels, y=vals, labels={'x':'Score 2 Bucket', 'y':'Default Rate'})
Unfortunately, most variables do not share the strong relationship the score variables have with the default rate. But differences in the default rate can be seen, as one can find in the plots below.
labels, vals = binner(final_dataset['AMT_INCOME_TOTAL'], final_dataset['TARGET'], [v for v in range(75000,250000,50000)])
px.bar(x=labels, y=vals, labels={'x':'Income Bucket', 'y':'Default Rate'})
labels, vals = binner(final_dataset['AMT_CREDIT'], final_dataset['TARGET'], [v for v in range(200000,1000000,200000)])
px.bar(x=labels, y=vals, labels={'x':'Loan Amount Bucket', 'y':'Default Rate'})
labels, vals = binner(final_dataset['DEFAULT_COUNT'], final_dataset['TARGET'], [v for v in range(1,5,1)])
px.bar(x=['0','1','2','3','4+'], y=vals, labels={'x':'Number of past loan defaults', 'y':'Default Rate'})
Below is an example of a variable I thought would have a relationship, but didn't appear to.
labels, vals = binner(final_dataset['AMT_REQ_CREDIT_BUREAU'], final_dataset['TARGET'], [v for v in range(1,9,1)])
px.bar(x=[str(v) for v in range(0,8)]+['8+'], y=vals, labels={'x':'Number of enquiries to Credit Bureau about the client', 'y':'Default Rate'})
Below are some plots with default rates for different categories. The code that creates these plots was mostly generated by mitosheet.
# Import plotly and create a figure
import plotly.graph_objects as go
from mitosheet import flatten_column_header
# Pivoted final_dataset_csv into df2
unused_columns = final_dataset.columns.difference(set(['CODE_GENDER']).union(set([])).union(set({'TARGET'})))
tmp_df = final_dataset.drop(unused_columns, axis=1)
pivot_table = tmp_df.pivot_table(
index=['CODE_GENDER'],
values=['TARGET'],
aggfunc={'TARGET': ['mean']}
)
# Flatten the column headers
pivot_table.columns = [flatten_column_header(col) for col in pivot_table.columns.values]
# Reset the column name and the indexes
df2 = pivot_table.reset_index()
fig = go.Figure()
# Add the bar chart traces to the graph
for column_header in ['CODE_GENDER']:
fig.add_trace(
go.Bar(
x=df2[column_header],
y=df2['TARGET mean'],
name=column_header
)
)
# Update the title and stacking mode of the graph
# See Plotly documentation for customizations: https://plotly.com/python/reference/bar/
fig.update_layout(
xaxis_title='Gender',
yaxis_title='Default Rate',
barmode='group',
)
fig.show()
# Pivoted final_dataset_csv into df4
unused_columns = final_dataset.columns.difference(set(['CNT_CHILDREN']).union(set([])).union(set({'TARGET'})))
tmp_df = final_dataset.drop(unused_columns, axis=1)
pivot_table = tmp_df.pivot_table(
index=['CNT_CHILDREN'],
values=['TARGET'],
aggfunc={'TARGET': ['count', 'mean']}
)
# Flatten the column headers
pivot_table.columns = [flatten_column_header(col) for col in pivot_table.columns.values]
# Reset the column name and the indexes
df4 = pivot_table.reset_index()
# Filtered CNT_CHILDREN in df4
df4 = df4[df4['CNT_CHILDREN'] < 4]
fig = go.Figure()
# Add the bar chart traces to the graph
for column_header in ['CNT_CHILDREN']:
fig.add_trace(
go.Bar(
x=df4[column_header],
y=df4['TARGET mean'],
name=column_header
)
)
# Update the title and stacking mode of the graph
# See Plotly documentation for customizations: https://plotly.com/python/reference/bar/
fig.update_layout(
xaxis_title='Number of Children',
yaxis_title='Default Rate',
xaxis = dict(tickmode = 'linear',
tick0 = 0,
dtick = 1)
)
fig.show()
# Pivoted final_dataset_csv into df5
unused_columns = final_dataset.columns.difference(set(['NAME_EDUCATION_TYPE']).union(set([])).union(set({'TARGET'})))
tmp_df = final_dataset.drop(unused_columns, axis=1)
pivot_table = tmp_df.pivot_table(
index=['NAME_EDUCATION_TYPE'],
values=['TARGET'],
aggfunc={'TARGET': ['count', 'mean']}
)
# Flatten the column headers
pivot_table.columns = [flatten_column_header(col) for col in pivot_table.columns.values]
# Reset the column name and the indexes
df5 = pivot_table.reset_index()
# Filtered TARGET count in df5
df5 = df5[df5['TARGET count'] > 1000]
fig = go.Figure()
# Add the bar chart traces to the graph
for column_header in ['NAME_EDUCATION_TYPE']:
fig.add_trace(
go.Bar(
x=df5[column_header],
y=df5['TARGET mean'],
name=column_header
)
)
# Update the title and stacking mode of the graph
# See Plotly documentation for customizations: https://plotly.com/python/reference/bar/
fig.update_layout(
xaxis_title='Education Level',
yaxis_title='Default Rate',
)
fig.show()
# Pivoted final_dataset_csv into df6
unused_columns = final_dataset.columns.difference(set(['OCCUPATION_TYPE']).union(set([])).union(set({'TARGET'})))
tmp_df = final_dataset.drop(unused_columns, axis=1)
pivot_table = tmp_df.pivot_table(
index=['OCCUPATION_TYPE'],
values=['TARGET'],
aggfunc={'TARGET': ['count', 'mean']}
)
# Flatten the column headers
pivot_table.columns = [flatten_column_header(col) for col in pivot_table.columns.values]
# Reset the column name and the indexes
df6 = pivot_table.reset_index()
# Filtered TARGET count in df6
df6 = df6[df6['TARGET count'] > 3000]
# Sorted TARGET mean in df6 in descending order
df6 = df6.sort_values(by='TARGET mean', ascending=False, na_position='last')
fig = go.Figure()
# Add the bar chart traces to the graph
for column_header in ['OCCUPATION_TYPE']:
fig.add_trace(
go.Bar(
x=df6[column_header],
y=df6['TARGET mean'],
name=column_header
)
)
# Update the title and stacking mode of the graph
# See Plotly documentation for customizations: https://plotly.com/python/reference/bar/
fig.update_layout(
xaxis_title='Occupation Type',
yaxis_title='Default Rate'
)
fig.show()
Below is the code to train a weighted random forest classifier and generate predictions.
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, precision_recall_fscore_support
from sklearn.preprocessing import OneHotEncoder
# Prework data necessary to fit the random forest model
encoder = OneHotEncoder(handle_unknown='ignore')
train_data = final_dataset.sample(n=200000)
train_data.reset_index(drop=True, inplace=True)
cat_col_idxs = [idx for idx,col_type in enumerate(train_data.dtypes) if col_type=='O']
train_data.iloc[:,cat_col_idxs].head()
encoder_df = pd.DataFrame(encoder.fit_transform(train_data.iloc[:,cat_col_idxs]).toarray())
del_cols = [col for idx,col in enumerate(train_data.columns) if idx in cat_col_idxs]
for col in del_cols:
del train_data[col]
train_data = train_data.join(encoder_df)
x_cols = list(train_data.columns[2:])
X_data = train_data[x_cols]
y_data = train_data['TARGET']
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=0.20)
# Calculate weights
pos_count = y_train.sum()
neg_count = len(y_train) - pos_count
pos_weight = neg_count/len(y_train)
neg_weight = pos_count/len(y_train)
# Fit model
rfc = RandomForestClassifier(n_estimators=1000, class_weight={0:neg_weight, 1:pos_weight})
rfc.fit(X_train, y_train)
# Generate predictions
y_pred = rfc.predict(X_test)
Here are some accuracy metrics:
acc_score = accuracy_score(y_pred,y_test)
print("Accuracy score: " + str(acc_score))
precision, recall, _, _ = precision_recall_fscore_support(y_pred,y_test)
print("Precision: " + str(precision[1]))
print("Recall: " + str(recall[1]))
Accuracy score: 0.9201 Precision: 0.00031279324366593683 Recall: 1.0
The confusion matrix below reveals that the classifier isn't paying attention to the positive class.
confusion_matrix(y_pred,y_test)
array([[36803, 3196],
[ 0, 1]], dtype=int64)
Although I modified the parameters for the random forest model in many different ways, I was unable to raise the precision for the positive label while producing reasonable results.
from interpret.glassbox import ExplainableBoostingClassifier
from interpret import show
# Data prep
train_data = final_dataset.sample(n=200000)
x_cols = list(train_data.columns[2:])
X_data = train_data[x_cols]
y_data = train_data['TARGET']
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=0.20)
# Fit model
ebm = ExplainableBoostingClassifier()
weights = np.array([pos_weight if ob == 1 else neg_weight for ob in y_train])
ebm.fit(X_train, y_train, sample_weight=weights)
# Generate predictions
y_pred = ebm.predict(X_test)
EBM lib loading. Loading native on win32 | debug = False Passing a numpy array to schema autogen when it should be dataframe. Passing a numpy array to schema autogen when it should be dataframe.
Here are some accuracy metrics:
acc_score = accuracy_score(y_pred,y_test)
print("Accuracy score: " + str(acc_score))
precision, recall, _, _ = precision_recall_fscore_support(y_pred,y_test)
print("Precision: " + str(precision[1]))
print("Recall: " + str(recall[1]))
Accuracy score: 0.703975 Precision: 0.6668779714738511 Recall: 0.1631766713199938
The precision is much higher than the random forest model, and the confusion matrix below also reveals that the classifier is paying attention to the positive class. Our accuracy may drop, but since we are dealing with an imbalanced dataset we want to trade off accuracy for correctly labeling the positive class.
confusion_matrix(y_pred,y_test)
array([[26055, 1051],
[10790, 2104]], dtype=int64)
ebm_global = ebm.explain_global()
The top 20 variables be importance can be found below.
ebm_imp_dct = ebm_global.data()
ebm_imp_dat = pd.DataFrame({'Variable':ebm_imp_dct['names'], 'Importance Score': ebm_imp_dct['scores']})
ebm_imp_dat.sort_values(by='Importance Score', ascending=False, inplace=True)
px.bar(x='Importance Score', y='Variable', data_frame=ebm_imp_dat.head(20), orientation='h')
Since EBM is a GAM, we can visualize the relationship between each dependent variable and the target. Below are 3 of these plots
fig = ebm_global.visualize(40)
fig.update_layout(yaxis_range=[-2,2])
fig.show()
fig = ebm_global.visualize(1)
fig.update_layout(yaxis_range=[-.25,.25])
fig.show()
fig = ebm_global.visualize(11)
fig.update_layout(yaxis_range=[-.3,.3])
fig.show()
ebm_local = ebm.explain_local(X_data[:100],y_data[:100])
One powerful feature of an EBM is the ability to see why an observation was assigned a label. Below are 3 examples of EBM's label justification. The first is a true negative. The applicant has great financial scores, and the price of whatever they are purchasing is a favorable amount as well, leading to a prediction of no default.
ebm_local.visualize(0)
The second is a true positive. The awful value on the 3rd financial score lead to a prediction of default.
ebm_local.visualize(2)
In the 3rd example, although the financial scores looked bad and lead to a default prediction, the loan applicant did not end up defaulting.
ebm_local.visualize(17)
I noticed that many of the relationships the EBM captured looked noisy. 2 examples can be seen below.
fig = ebm_global.visualize(75)
fig.update_layout(yaxis_range=[-1,1], xaxis_range=[0,100])
fig.show()
The default count vs score chart looks very odd. When you look at the density, you notice that there are few observations greater than 10 (in reality there are few greater than 4). Below is the same relationship but for a smaller range.
fig.update_layout(yaxis_range=[-.25,.25], xaxis_range=[0,20])
fig.show()
fig = ebm_global.visualize(4)
fig.update_layout(yaxis_range=[-2,2])
fig.show()
The count of children is another clear example of a weird relationship, more likely to be caused by the low amount of observations at higher values rather than a real trend. By default, EBMs have a minimal sample size of 2 for each leaf. I up that number significantly to force the EBM to only create leaves for significant sample sizes.
# Fit tuned model
ebm_tuned = ExplainableBoostingClassifier(min_samples_leaf=1000)
ebm_tuned.fit(X_train, y_train, sample_weight=weights)
# Generate predictions
y_pred_tuned = ebm_tuned.predict(X_test)
Passing a numpy array to schema autogen when it should be dataframe. Passing a numpy array to schema autogen when it should be dataframe.
acc_score = accuracy_score(y_pred_tuned,y_test)
print("Accuracy score: " + str(acc_score))
precision, recall, _, _ = precision_recall_fscore_support(y_pred_tuned,y_test)
print("Precision: " + str(precision[1]))
print("Recall: " + str(recall[1]))
Accuracy score: 0.7037 Precision: 0.6675118858954041 Recall: 0.16314199395770393
When we raised the minimum number of samples in every leaf parameter from 2 to 1000, we do not see a significant difference accuracy wise.
ebm_global_tuned = ebm_tuned.explain_global()
fig = ebm_global_tuned.visualize(75)
fig.update_layout(yaxis_range=[-.25,.25], xaxis_range=[0,20])
fig.show()
Our default count relationship did not change much, so clearly there are a lot of observations that are driving it.
fig = ebm_global_tuned.visualize(4)
fig.update_layout(yaxis_range=[-.1,.1])
fig.show()
The relationship between count of children and default is much less volatile.